In the following, I’ll illustrate how to load the raw data and calculate various metrics

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.6.2
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(scales)
## Warning: package 'scales' was built under R version 3.6.2
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(RcppRoll)

The following will of course be specific to your environment

data<-read_csv("./output/tweetjson006_annotated_tweets.csv")
## Warning: Missing column names filled in: 'X1' [1]
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   tweet_created_at = col_datetime(format = ""),
##   tweet_text = col_character(),
##   tweet_entities = col_character(),
##   tweet_public_metrics = col_character(),
##   tweet_referenced_tweets_id = col_character(),
##   tweet_referenced_tweets_type = col_character(),
##   date_floor = col_datetime(format = "")
## )
## ℹ Use `spec()` for the full column specifications.
data

Note that all topic assignments are “topic_X” comments

First, visualizing at different levels of aggregation. My strategy here is to select the data that I want to manipuate, then pivot, then aggregate. I’m going to write this code using functions to make things more modular, but you don’t have to.

Prepping the data

The strategy here is to summarize by day in each column, join with a date sequence, pivot, then fill NAs.

prep_data<-function(d) {
  d %>% mutate(day = as.Date(floor_date(tweet_created_at,unit="day"))) %>% group_by(day) %>% summarise_at(vars(starts_with("topic_")),sum)->d
  seq_dates<-tibble(day = seq.Date(from=min(d$day),to=max(d$day),by="day"))
  d <- d %>% right_join(seq_dates) %>% select(day, starts_with("topic_")) %>% pivot_longer(names_to = "topic", values_to = "weight", starts_with("topic_")) %>% replace_na(list(weight=0))
}


data.s <- prep_data(data)
## Joining, by = "day"
data.s

Looks good!

Graphing the data

Once again, I’ll do this as a function. Note, I’m going to shuffle the colors around here to help me see the topic boundaries. The default discrete palette is “hue ordered”, making hard to see where the boundaries are.

library(scales)
library(colorspace)
## Warning: package 'colorspace' was built under R version 3.6.2
plot_topics<-function(long_data) {
  num_topics = length(unique(long_data$topic))
  
  # Going to use a trick here to make sure I get distant colors next to one another
  cols <- hue_pal()(num_topics)
  half <- 1:ceiling(length(cols)/2)
  cols <-lighten(muted(as.vector(rbind(cols[half],cols[-half]))),.5)
  g<-ggplot(long_data)+geom_area(aes(x=day,y=weight,fill=topic)) + scale_fill_manual(values = cols)+guides(fill=guide_legend(ncol=2))
  return(g)
}
plot_topics(data.s)
## Warning in rbind(cols[half], cols[-half]): number of columns of result is not a
## multiple of vector length (arg 2)

Ok, so that sort of sucks, so we’ll do a little aggregation. I’m going to add a function here to bin the data. Also adding a normalization parameter if I want to look at proportions.

bin_data<-function(long_data,num_days,normalize = F) {
  d<- long_data%>% ungroup() %>% mutate(index = floor(as.numeric(day - min(day)) / num_days)) %>% group_by(index,topic) %>% summarise(weight = sum(weight),day = min(day)) %>% ungroup() %>% select(-index)
   if (normalize) {
    d %>% group_by(day) %>% mutate(weight = weight / sum(weight))-> d
  }
  return(d)
}

binned_data <- bin_data(data.s,7)
## `summarise()` has grouped output by 'index'. You can override using the `.groups` argument.
plot_topics(binned_data)+ggtitle("Binning by 7 days")
## Warning in rbind(cols[half], cols[-half]): number of columns of result is not a
## multiple of vector length (arg 2)

Ok, this is visually a bit jarring, but I can begin to see the individual topics. Let’s look at a few more.

binned_data <- bin_data(data.s,15)
## `summarise()` has grouped output by 'index'. You can override using the `.groups` argument.
plot_topics(binned_data)+ggtitle("Binning by 15 days")
## Warning in rbind(cols[half], cols[-half]): number of columns of result is not a
## multiple of vector length (arg 2)

binned_data <- bin_data(data.s,15,T)
## `summarise()` has grouped output by 'index'. You can override using the `.groups` argument.
plot_topics(binned_data)+ggtitle("Binning by 15 days,normalized")
## Warning in rbind(cols[half], cols[-half]): number of columns of result is not a
## multiple of vector length (arg 2)
## Warning: Removed 19 rows containing missing values (position_stack).

I notice some interesting variance in topic 5 and topic 3 in the early part of 2020, but otherwise, nothing tremendously useful. Might be nice to label the topics right on the graph, but we can do that later. See this stack over flow post.

Also, it occurs to me that I could smooth this out quite a bit by rolling a window over the data. I’m going to use RCppRoll, and I’ll use mean values rather than sums

roll_data<-function(long_data,win_size = 5, by = 1, normalize = F) {
  # To make life easier, I'm going to pivot my long data to wide
  wd<-pivot_wider(long_data,names_from = topic,values_from = weight) %>% arrange(day)
  rolled<-as_tibble(apply(wd %>% select(starts_with("topic_")),2,function(x) roll_mean(x,n = win_size,by = by)))

  win_ends <- roll_max(1:nrow(wd),n=win_size,by=by)

  rolled$day = wd$day[win_ends]
  r<-rolled %>% select(day,everything()) %>% pivot_longer(names_to = "topic", values_to = "weight", starts_with("topic_"))
  if (normalize) {
    r %>% group_by(day) %>% mutate(weight = weight / sum(weight))-> r
  }
  return(r)
  
}

roll_data(data.s,7,1)

Looks ok. Let’s try it out. Expect to see much smoother graph.

rolled_data <- roll_data(data.s,15,1)
plot_topics(rolled_data)+ggtitle("Rolling by 15 days")
## Warning in rbind(cols[half], cols[-half]): number of columns of result is not a
## multiple of vector length (arg 2)

Double checking - if we advance by 15 days at a time, this should look very similar to the binned data

rolled_data <- roll_data(data.s,15,15)
plot_topics(rolled_data)+ggtitle("Rolling by 15 days, delta = 15")
## Warning in rbind(cols[half], cols[-half]): number of columns of result is not a
## multiple of vector length (arg 2)

Great, finally, with normalization

rolled_data <- roll_data(data.s,15,1, T)
plot_topics(rolled_data)+ggtitle("Rolling by 15 days, delta = 1, normalized")
## Warning in rbind(cols[half], cols[-half]): number of columns of result is not a
## multiple of vector length (arg 2)
## Warning: Removed 361 rows containing missing values (position_stack).

Calculate Weighted Jaccards

Using the above, we’ll create a weighted jaccards function

weighted_jaccard<-function(x,y) {
  n<-sum(pmin(x,y))
  d<-sum(pmax(x,y))
  ifelse(d==0,0,n/d)
}

# Presume our data has already been binned / rolled
calc_topic_churn<-function(long_data) {
  long_data %>% group_by(topic) %>% arrange(day,.by_group = TRUE) %>% mutate(lagged_weights = lag(weight,1,order_by = day)) -> lagged_data
  #return(lagged_data)
  lagged_data %>% filter(!is.na(lagged_weights)) %>% group_by(day) %>% summarise(jaccard = weighted_jaccard(weight,lagged_weights))
}

calc_topic_churn(data.s)

Looks good, so checking plotting

ggplot(calc_topic_churn(data.s))+geom_line(aes(day,jaccard))

Now with binning

rolled_data<-roll_data(data.s,7,by=7)
ggplot(calc_topic_churn(rolled_data))+geom_line(aes(day,jaccard))+theme_minimal()+ylim(0,1)

Cosine similarity

We can do the same thing with cosine similarity.

cosine_similarity<-function(x,y) {
  if (length(x) != length(y)) {
    stop("x and y must be equal length vectors")
  }
  n = sum(x*y)
  d = sqrt(sum(x^2))*sqrt(sum(y^2))
  ifelse(d==0,0,n/d)
}

# Presume our data has already been binned / rolled
calc_cosine_similarity<-function(long_data) {
  long_data %>% group_by(topic) %>% arrange(day,.by_group = TRUE) %>% mutate(lagged_weights = lag(weight,1,order_by = day)) -> lagged_data
  #return(lagged_data)
  lagged_data %>% filter(!is.na(lagged_weights)) %>% group_by(day) %>% summarise(similarity = cosine_similarity(weight,lagged_weights))
}

calc_cosine_similarity(data.s)

Looks good. Plotting as before, comparing the two.

ggplot(calc_cosine_similarity(data.s))+geom_line(aes(day,similarity))+theme_minimal()+ylim(0,1)+ggtitle("Cosine similarity")

ggplot(calc_topic_churn(data.s))+geom_line(aes(day,jaccard))+theme_minimal()+ylim(0,1)+ggtitle("Jaccard")

rolled_data<-roll_data(data.s,7,7)

ggplot(calc_cosine_similarity(rolled_data))+geom_line(aes(day,similarity))+theme_minimal()+ylim(0,1)+ggtitle("Cosine similarity")

ggplot(calc_topic_churn(rolled_data))+geom_line(aes(day,jaccard))+theme_minimal()+ylim(0,1)+ggtitle("Jaccard")

Looking at entropy

One last potential measure here - we’ll have a look at entropy. Note that entropy is calculated within a window, rather than by comparing two windows. Also, entropy is not normalized.

entropy<-function(x,base=exp(1)) {
  p = x/sum(x)
  -sum(p*log(p,base))  
}

# Presume our data has already been binned / rolled
calc_entropy<-function(long_data) {
  long_data %>% group_by(day) %>% summarise(entropy = entropy(weight))
}

calc_entropy(data.s)
ggplot(calc_entropy(data.s))+geom_line(aes(day,entropy))+theme_minimal()+ggtitle("Entropy")
## Warning: Removed 54 row(s) containing missing values (geom_path).

I find this a little unintuitive though, so using the definition of skew from Introne & Goggins (2015)

skew<-function(x) {
  if (length(x)==0) {
    return(0)
  } else {
    p = x/sum(x)
    1 - exp(-sum(p*log(p)))/length(x)
  }
}

# Presume our data has already been binned / rolled
calc_skew<-function(long_data) {
  long_data %>% group_by(day) %>% summarise(skew = skew(weight))
}

calc_skew(data.s)
ggplot(calc_skew(data.s))+geom_line(aes(day,skew))+ylim(0,1)+theme_minimal()+ggtitle("Skew")
## Warning: Removed 54 row(s) containing missing values (geom_path).

Great. This indicates that there’s a pretty even balance here across the topics over time.

rolled_data<-roll_data(data.s,7,1)

ggplot(calc_skew(rolled_data))+geom_line(aes(day,skew))+ylim(0,1)+theme_minimal()+ggtitle("Skew")
## Warning: Removed 48 row(s) containing missing values (geom_path).